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Abstract 

Compared with microarray-based genotyping, next-generation whole genome sequencing (WGS) studies have the 
strength to provide greater information for the identification of rare variants, which likely account for a significant 
portion of missing heritability of common human diseases. In WGS, family-based studies are important because 
they are likely enriched for rare disease variants that segregate with the disease in relatives. We propose a 
multilevel model to detect disease variants using family-based WGS data with longitudinal measures. This model 
incorporates the correlation structure from family pedigrees and that from repeated measures. The iterative 
generalized least squares algorithm was applied to estimation of parameters and test of associations. The model 
was applied to the data of Genetic Analysis Workshop 18 and compared with existing linear mixed-effect models. 
The multilevel model shows higher power at practical p-value levels and a better type I error control than linear 
mixed-effect model. Both multilevel and linear mixed-effect models, which use the longitudinal repeated 
information, have higher power than the methods that only use data collected at one time point. 



Background 

Whole genome sequencing (WGS) provides comprehen- 
sive collection of genetic variations and thus is promising 
in discovering novel inheritable factors for both Mendelian 
and complex traits. Two data properties distinguish WGS 
from microarray-based genome-wide association study 
(GWAS). First, WGS data contain rare causal mutations 
that could have large allelic effect. However, the statistical 
association for such rare variants is weak at population 
level because of small allele frequency [1]; therefore, popu- 
lation-based case-control study, which is commonly 
applied in GWAS, is less powerful for WGS. Second, 
family design is attractive and commonly applied in WGS 
studies. Causal rare variants are likely enriched through 
cotransmission in families. Moreover, pedigree structures 
allow statistical imputation of genotypes without experi- 
mental cost [2]. Additionally, family-based data analyses 
automatically control for population stratification and are 
potentially able to incorporate helpful genetic information 
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on phase, effects of parental origin, cotransmission of var- 
iants, and so on[3]. 

Detection of disease variants can also be facilitated by 
trajectory information on individual changes over time. 
Longitudinal genetic studies enable a close investigation 
of both genetic factors that lead to a disease and envir- 
onmental determinants that modulate the subsequent 
progression of the disease. In WGS, it is important to 
develop powerful methods that accommodate both 
within-family correlation structure and correlation 
among repeated measures. Here we extend a multilevel 
model [4,5] to WGS longitudinal family data, which 
simultaneously accounts for familial and time-series cor- 
relations. The implementation is based on the iterative 
generalized least squares (IGLS) algorithm [6,7], which 
allows conclusions to be drawn about both genetic and 
environmental effects while controlling the complex corre- 
lation structure. We assessed the multilevel model by com- 
paring with the linear mixed-effects (LME) models using 
"dose" genotypes on chromosome 3 and the 200 simulation 
replicates of longitudinal response and covariates provide 
by Genetic Analysis Workshop 18 (GAW18) [8]. 



o 



© 2014 Chen et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
DEmIUIa«I r~cir^-l-rtal Attribution License (http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and reproduction in 
DlwlVKSn \_fc:l I Lid I g^iy medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// 

creativecommons,org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherv\/ise stated. 



Chen ef al. BMC Proceedings 2014, 8(Suppl 1):S86 
httpy/www.biomedcentral.coni/1753-6561/8/S1/S86 



Page 2 of 5 



Methods 

Method 1: LME model 

Linear mixed-effects models offer a natural approach to 

deal with correlation structures among observations. For 
longitudinal family data, we can define an LME model: 

Y,jk = AjkP + AjkYk + (^ijk, (1) 

where Yijk is response of the rth repeated measure of 
the /th individual in the kXh family, where i = 1, ■ • ■ , rijk, 
j = I, mfe, and k = 1, . . . ,K, with "jfe being the num- 
ber of measures for individual /' in family k and mk 
being the number of individuals in family k; is a 
covariate vector (including genotype) for fixed effects 
zp. Zyfe is a covariate vector for random effects Yk, 
where yk '■= ivik ■ ■ ■ Ymuk)' ~ N(0, Du), Dfe the covariance 
matrix among individuals in family k (e.g., the kinship 
matrix). Also, jk ■= Ujk ■ ■ ■ njkjk) ~ N(0, Hjk), where is 
the covariance matrix among the repeated measures 
for individual / in family k. We assume Yk and fjfe are 
independent between each other and among them- 
selves for all / and k. To implement the LME model, 
we applied the following R package: 

GWAF: R package GWAF was design for genome-wide 
analysis for family data [9]. It accounts for the pedigree 
correlation structure by kinship matrix. However, it does 
not handle longitudinal repeated measures. So this 
method was used to represent the cross-sectional analysis 
for family data and was compared with other family-data 
analysis incorporating longitudinal information. 

Lmekin: R function Imekin in package coxme [10] was 
applied to account for both the family correlation struc- 
ture and the correlation structure of the longitudinal 
repeated measures. Specifically, we set the model that 
includes a random intercept at individual level to account 
for the correlation of repeated measures assuming com- 
pound symmetry structure, a random intercept at family 
level to account for the clustering effect among family 
members. Furthermore, the kinship matrix was incorpo- 
rated through its varlist option to account for the kinship 
correlation among family members. 

Method 2: Multi-level model 

We extend the classic multi-level model [4,5,11] to analyze 
WGS family data with longitudinal repeated measures. 
The response for the ith measure (level 1) of the yth indivi- 
dual (level 2) in the kth family (level 3) can be written as 

Yijk = Xijk'P +Uk+ gjk + Vij + Cijk, (2) 

where Xijk and are similarly defined in (1). The rest 
random-effect terms on the right side of the equation 
are normal distributed with mean zero and variance 
characterizing the correlation structure among 



observations. Denote the response vector y = {yijk)- We 
have y ~ N[xf}, V), where 

Var {y) = V = Acx^ + Ba^ + Ca^ + la^ . (3) 

The first random term Wfe characterizes the clustering 
effects at the family and individual levels. Specifically, 
A = ®k{Jk <8) /*). where /t is a matrix of I's with dimen- 
sion being the size of ^h family, /* is a matrix of I's 
with dimension being the number of repeated measures 
per individual. ® denotes the matrix direct sum, and ® 
denotes the Kronecker product. The second random 
term Sjk indicates the genetic correlation (kinship coeffi- 
cients) among individuals in the Ath family. Mathemati- 
cally, B = ©fe(Dfe (g) /*), where Dfe is the kinship matrix. 
The third random term ^ij indicates the correlation 
among repeated measures in the /th individual: 
C = ©fe(/fe ® R), where Ik is an identity matrix with 
dimension being the size of the ^h family and, R is the 
correlation matrix among repeated individuals. For 
example, if we assume compound symmetry structure, 
for three repeated measures, 

/lp/9\ /100\ /011\ 
J? = p 1 p U 0 1 0 + 1 0 1 p. (4) 

Vppi/ Vool/ \iio/ 

So the term can be decomposed as Ca^ = Cid^ + Cj^pa^, 
such that the matrixes are all known and the parameters 
can be estimated as described below. Certainly, more com- 
plicated correlation structure can be modeled by a further 
decomposition according to the number of covariance 
parameters to be estimated. Finally, ^k is the independent 
and identically distributed error term, and / is the identity 
matrix for all observations. 

For the inference of the multilevel model, the IGLS 
algorithm [6,7] is applied. Let y = y — '^fi. Note that 

E (yy') = V = Aal + Ba^ + Cia^ + Crpu^ + la^. (5) 

Step 1: Given p, estimate V by the least squares estimation 

of variance [12]. Specifically, this is a procedure of fitting 
regression model of response vector y* = vec (yy') to the 
design matrix X* = [vec (A) , vec (B) ,vec(Ci), vec (C2) , vec{I) ], 
where vec (A) denotes the vectorization of the upper trian- 
gular part of matrix A- So, 

(ai,ij,&^,p^i,i^"^ = (X' (V ' (3 V"')X')"'X*' (V"' ® V"')y* and p . (6) 

Step 2: Given V, estimate by the weighed least 
squares estimate: 

P = {x!V-^xY^x!V-^y. (7) 

The estimation procedure starts at an arbitrary /J (e.g., 
obtained from a multiple regression fitting) and then 
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iterates between steps 1 and 2 until convergence. 
Because the IGLS estimate is equivalent to the 
restricted maximal likelihood estimate [4], we can 
apply a Z-test to calculate j?-values for the elements in 
which contains the fixed genetic effects. In particu- 
lar, because Var (^fi^ = {x'V^^x)^^, the Z-test statistic 

for Pj is Zj = Pj/Var (^pj , and the two-tailed jj-value is 

= Pr(|N(0, 1)1 > |Z,|). Certainly, this multilevel 
model has the potential to be further extended to 
incorporate a more complicated covariance structure 
for more sophisticated modeling. 

Results 

For evaluating the methods, we used the "dose" geno- 
type data of the 169 true single-nucleotide variants 
(S^A'^s) on chromosome 3 that were associated with dia- 
stolic blood pressure (DBP) in 200 simulation replicates. 
These data contain 849 individuals in 20 families, and 
the number of individuals in families is 21 to 74, with 
the mean 42.45 and the median 36.5. Kinship matrixes 
of these families were directly calculated based on the 
pedigree information. The above models were fitted 
with or without covariates: age, blood pressure medicine 
status, and sex. For GWAF, which does not analyze 
longitudinal data, we applied the DBP at the first time 
point as the response. For Imekin and multilevel model, 
we applied all three longitudinal repeated measures. The 
knowledge of the true SNVs was only used for evaluat- 
ing the power of these association tests, not for the data 
analysis strategy. 

First, we evaluated the type I error rate control for 
these methods. Fitting the 169 DBP-related SNVs on 
chromosome 3 to Ql, a null response provided by 



GAW18 "to facilitate assessment of type I error," we 
plotted in Figure la the false-positive rates over a variety 
of p-value cutoffs. It is clear that the type I error rate of 
Imekin is highly inflated, and the type I error rates of 
multilevel model and GWAF are closer to the expected 
level around the diagonal line. The inflation is worse 
when covariates are contained in the models (denoted 
"covar"). We also studied the type I error rate through 
permutation. Figure lb shows the false-positive rates for 
fitting the permuted genotype data of these SNVs to DBP 
response, which retained the relationship between covari- 
ates and DBP but destroyed the association between 
SNVs and DBP. Now both Imekin and our multilevel 
models control the type I error rate perfectly well. To 
explain the puzzle, we checked the GAWIS "answers" 
and found that Ql was simulated as a quantitative trait 
correlated among family members with heritability 0.68, 
but the total heritability for DBP is only 0.317. This 
means that Ql values have stronger correlation than 
DBP values do. The inflation of the type I error of Imekin 
indicates that this LME model is less capable than our 
multilevel model in accounting for the correlation among 
individuals (cf [13]). 

We studied the power of detecting the 169 DBP-related 
SNVs on chromosome 3. Based on the phenotype data in 
the simulation replicate 1, Figure Ic shows the true posi- 
tive rate of detecting these true SNVs over a variety of 
p-val\ie cutoffs. In general, the power of detecting true 
SNVs is low at small or moderate p-vahies. This phenom- 
enon indicates that the sample size is still relatively too 
small to detect a large proportion of the weak genetic 
effects simulated in the data. At the same time, longitudi- 
nal methods (Imekin and multilevel models) are better 
than the one-time-point model (GWAF); the latter does 
not have much power except for the strongest SNVs. The 
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Figure 1 Type I error and power for detecting DBP-related. single-nucleotide variants (SNVs) on chromosome 3. Considering all 169 diastolic 
blood pressure (DBP)-related SNVs on chromosome 3, the type I error rates were estimated by the false positive rates when Ql was the null 
response (a) and when the genotypes are permuted (b); the power was estimated by the true positive rate when DBP was the response (c). A 
model with or without containing covariates (age, blood pressure medicine status, and sex) is denoted by its name with or without "covar". 
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Imekin and the multilevel models have similar perfor- 
mance overall, but the multilevel model is better at the 
region of relatively small j?-values (e.g., />-value < 0.1) that 
are of practical interest. For both Imekin and multilevel 
models, there is no big difference between the models 
with and without covariates. We also studied the power of 
detecting specific SNVs by using the data of 200 simula- 
tion replicates. For example, by the multilevel model with 
covariates, the strongest SNV at location 48040283 always 
got significant /(-values from 1.8 x 10^^ to 3.09 x 10^. 

Discussion 

In this work, our main focus was to see whether model- 
ing longitudinal data could provide helpful information 
to increase the power of detecting true SNVs when com- 
pared with the methods for analyzing data at one time 
point. Here we directly apphed the original genotype data 
into modeling and illustrated that the longitudinal 
repeated observations were indeed helpful to detect DBP- 
related genetic factors. However, many true SNVs are 
rare variants, some of which could have big allelic effect 
for specific individuals when the disease mutation pre- 
sents. As a result of small minor allele frequency (MAF), 
the association between such rare variants and their cor- 
responding phenotypes is still weak at the population 
level [1]. This may be one of the main reasons why the 
overall power is low in detecting the majority of the cau- 
sal or regulatory genetic factors. Various strategies of 
rare-variant collapsing procedures [14,15] could be 
applied to grouping and combining genotypes of rare var- 
iants, which has potential to further increase the power. 

The computational speed of the multilevel model is 
comparable with the linear mixed-effects model estima- 
tion by Imekin. Both models are computationally 
demanding (e.g., -10 minutes for our implementation of 
multilevel model and 8 minutes for Imekin to process 
one SNV on a MacBook Pro with 2.9-GHz Intel Core 
i7). However, we observed that the convergence speed 
of the iterative generalized least squares algorithm for 
the multilevel model is relatively fast: the results usually 
do not change much after two iterations. So, restricting 
the number of iterations could potentially reduce com- 
putational time. Further study on improving computa- 
tion efficiency will be carried out in the near future. 

Conclusions 

We developed a multilevel model for fitting family-based 
genotype data and repeated measures of covariates to 
quantitative longitudinal response, which accounts for 
correlations among individuals, nesting effects at the 
family and individual levels, and the time series correla- 
tions due to the repeated measures of covariates and 
responses. Using the simulated data of GAW18, this 
method showed more accurate type I error control than 



the LME model by Imekin, which is likely the result of 
better account for correlations among individuals. The 
multilevel model also provided higher power at small p- 
value cutoffs. At the same time, both Imekin and multi- 
level model, which use the longitudinal information, 
have higher power than GWAF, which only models data 
at one time point. 
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